Guest lecture, University of Edinburgh, 12.11.2019
John Quinn
This lecture is a case study of how different MLPR techniques can be combined to give the basis of a system to monitor viral crop disease.
from IPython.display import Image, IFrame
IFrame('https://docs.google.com/presentation/d/1RFGHYcsdxg1jdl5ofxFZeD78FfGkTlzj6VsqEm3ZJOY/preview#slide=id.g6b05a563b6_0_0', width=800, height=450)
Read in data on positions of cassava plants, and whether they are categorised as having Cassava Brown Streak Disease (CBSD) or healthy. We just take a subset of the data for now so that the examples run quickly.
import GPy
import mplleaflet
import pandas as pd
%pylab inline
df = pd.read_csv('cbsd_locations.csv')
df = df.loc[::10,:]
plt.plot(df.lon[df.cbsd==0], df.lat[df.cbsd==0], 'yo', markersize=10,
alpha=0.5, markeredgecolor='none')
plt.plot(df.lon[df.cbsd==1], df.lat[df.cbsd==1], 'bo', markersize=10,
alpha=0.5, markeredgecolor='none')
plt.gcf().set_size_inches(10, 10)
# Display on interactive map
mplleaflet.display(tiles='cartodb_positron')
If we wanted to know what we thought the prevalence of CBSD disease was at a longitude/latitude that we didn't sample, how would we calculate it?
Don't be put off by the fact that we're dealing with locations on a map: we're basically just trying to learn a function with 2-dimensional inputs and a single output.
In this case, we could use functions where the output is a class (diseased or healthy) or a number (prevalence of disease) at the corresponding location. In other words, this could be modelled as either a classification or a regression problem. Almost any supervised learning method could be used to give some sort of prediction here!
Defining prediction locations
We can construct a grid of coordinates $f_{*} \in \mathbb{R}^2$ covering the spatial extent we're interested in. To get a disease map, we make predictions at all of these points.
GRID_BOUNDS = (29.571499, 35.000273, -1.47887, 4.234466)
grid_points_per_degree = 10.
grid_height = int((GRID_BOUNDS[1] - GRID_BOUNDS[0]) * grid_points_per_degree)
grid_width = int((GRID_BOUNDS[3] - GRID_BOUNDS[2]) * grid_points_per_degree)
x1 = np.linspace(GRID_BOUNDS[0], GRID_BOUNDS[1], grid_width)
x2 = np.linspace(GRID_BOUNDS[2], GRID_BOUNDS[3], grid_height)
xv, yv = meshgrid(x1, x2)
X_star = np.array([xv.ravel(), yv.ravel()]).transpose()
plt.plot(X_star[:,0], X_star[:,1], 'k+')
# Display on interactive map
mplleaflet.display(tiles='cartodb_positron')
Any classification or regression method could be used here, though how useful they are to this problem depends on what assumptions the method is based on.
import sklearn.neighbors
import sklearn.linear_model
model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=5)
#model = sklearn.linear_model.LogisticRegression()
X = df.loc[:, ['lon','lat']]
f = df.cbsd
model.fit(X, f)
f_star = model.predict_proba(X_star)
plt.scatter(X_star[:,0], X_star[:,1], c=f_star[:,1], cmap=plt.cm.plasma_r,
s=20, alpha=.8, edgecolors='none', marker='s')
plt.gcf().set_size_inches(10, 10)
# Display on interactive map
mplleaflet.display(tiles='cartodb_positron')
Logistic regression doesn't seem very useful here. Nearest neighbours looks a bit more plausible.
Disease maps are used to make decisions with real consequences, such as how to allocate scarce resources or whether to destroy crops in a disease frontier area. We should at least know how confident we are in our predictions.
We can model crop disease observation data by assuming the level of disease incidence $f_i$ at a location $\mathbf{x}_i$ is strongly related to the level of disease incidence $f_j$ at location $\mathbf{x}_j$ if these locations are close together.
First we have to say what 'close together' means. The most obvious thing to look at is the squared Euclidean distance:
\begin{equation} \| \mathbf{x}_i - \mathbf{x}_j \|^2 \end{equation}
However the Euclidean distance is in an arbitrary scale. Using our knowledge of the problem, we can define a 'length scale' $\ell$ so that we have some calibration:
\begin{equation} \frac{ \| \mathbf{x}_i - \mathbf{x}_j \|^2 }{\ell^2 } \end{equation}
Now that we've said what 'close together' means, let's look at what 'strongly related' means. We can convert our distance function into a new expression which maps everything into the range 0 to 1:
\begin{equation} k(\mathbf{x}_i, \mathbf{x}_j) = \exp \left( - \frac{ \| \mathbf{x}_i - \mathbf{x}_j \|^2 }{\ell^2 } \right) \end{equation}
If the two points are at the same location, then we get $k(\mathbf{x}_i, \mathbf{x}_j) = 1$, and as they get further apart then it gets closer (but never quite reaches) 0.
We can call $k(\mathbf{x}_i, \mathbf{x}_j)$ our covariance function, and set $\ell$ with knowledge about the infection mechanism of the disease. This gives us a Gaussian process. The further away the locations are, the less they affect each other, and so we have a basis for calculating how confident our predictions are at each point.
# The units are decimal degrees, where 1 degree equals approx 111km at the equator.
kernel = GPy.kern.RBF(2, lengthscale=1)
# Create a simple GP model, which aims to predict CBSD status given location.
model = GPy.models.GPRegression(X, np.array([f]).transpose(),kernel)
# OPTIONAL: optimise the parameters
#model.optimize(messages=True,max_f_eval = 1000)
# Make predictions at each point on the grid
f_star, S = model.predict(X_star)
# It's possible to predict negative values, so only keep the positive part.
f_star[f_star < 0] = 0
marker_scale_factor = .2
plt.scatter(X_star[:, 0], X_star[:, 1], c=f_star[:, 0], s=(marker_scale_factor / (S - 1)),
cmap=plt.cm.plasma_r, alpha=.8, edgecolors='none', marker='s')
plt.gcf().set_size_inches(10, 10)
# Display on interactive map
mplleaflet.display(tiles='cartodb_positron')